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Abstract We propose extensions to FORTRAN which integrate forward and reverse 
QQ ■ Automatic Differentiation (AD) directly into the programming model. Irrespective 

of implementation technology, embedding AD constructs directly into the language 
extends the reach and convenience of AD while allowing abstraction of concepts 
of interest to scientific-computing practice, such as root finding, optimization, and 
finding equilibria of continuous games. Multiple different subprograms for these 
Y^ \ tasks can share common interfaces, regardless of whether and how they use AD in- 

ternally. A programmer can maximize a function f by calling a Hbrary maximizer, 
XSTAR=ARGMAX (F, xo ) , which internally constructs derivatives of f by AD, with- 
out having to learn how to use any particular AD tool. We illustrate the utility of 
these extensions by example: programs become much more concise and closer to 
^+ ■ traditional mathematical notation. A companion paper describes how these exten- 

Tlj- I sions can be implemented by a program that generates input to existing FORTRAN- 

based AD tools. 

^^ ■ Key words: Nesting, multiple transformation, forward mode, reverse mode, Tape- 

Cn I NADE, Adifor, programming-language design 
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1 Introduction 

The invention of FORTRAN was a major advance for numeric computing, allowing 
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to be transcribed into a natural but unambiguous notation 



FUNCTION G(X, ALPHA, BETA) 

G=BETA**ALPHA/GAMMA (ALPHA) *X** (ALPHA-1) *EXP (-BETA*X) 

END 



which could be automatically translated into an executable program. However, tran- 
scribing 

Xi+l ^Xi-f{xi)/f'{xi) 



to Fortran in 






FUNCTION RAPHSN(F, FPRIME, 


XO, 


N) 


EXTERNAL F, FPRIME 






X = XO 






DO 1690 1=1, N 






1690 X = X-F(X) /FPRIME (X) 






RAPHSN = X 






END 







requires that the caller provide both f and fprime. Manually coding the latter from 
the former is, in most cases, a mechanical process, but tedious and error prone. 

This problem has traditionally been addressed by arranging for an AD prepro- 
cessor to produce FPRIME (Speelpenning, 1980; Wengert, 1964). That breakthrough 
technology not only relieves the programmer of the burden of mechanical coding of 
derivative-calculation codes, it also allows the derivative code to be updated auto- 
matically, ensuring consistency and correctness. However, this caller derives disci- 
pline has several practical difficulties. First, the user must learn how to use the AD 
preprocessor, which constitutes a surprisingly serious barrier to adoption. Second, 
it makes it very difficult to experiment with the use of different sorts of derivatives 
(e.g., adding a Hessian- vector product step in an optimization) in such called subpro- 
grams, or to experiment with different AD preprocessors. Third, although prepro- 
cessors might be able to process code which has already been processed in order to 
implement nested derivatives, the maneuvers required by current tools can be some- 
what arcane (Siskind and Pearlmutter, 2008a). Fourth, software engineering princi- 
ples of locality and atomicity are being violated: knowledge of what derivatives are 
needed is distributed in a number of locations which must be kept consistent; and 
redundant information, which must also be kept consistent, is being passed, often 
down a long call chain. We attempt to solve these problems, making the use of AD 
more concise, convenient, and intuitive to the scientific programmer, while keep- 
ing to the spirit of FORTRAN. This is done using the Forward And Reverse Fortran 
Extension Language or Farfel, a small set of extensions to FORTRAN, in concert 
with an implementation strategy which leverages existing FORTRAN compilers and 
AD preprocessors (Bischof et al., 1992; Hascoet and Pascual, 2004). 

The remainder of the paper is organized as follows: Section 2 describes Farfel. 
Section 3 describes a concrete example FARFEL program to both motivate and illu- 
minate the proposed extensions. Section 4 situates this work in its broader context. 
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Section 5 summarizes this work's contributions. A companion paper (Radul et al., 
2012) describes how Farfel can be implemented by generating input to existing 
AD tools. 



2 Language Extensions 

Farfel consists of two principal extensions to Fortran: syntax for AD and for 
nested subprograms. We currently support only FORTRAN??, but there is no barrier, 
in principle, to adding FARFEL to more recent dialects. 

Extension 1: AD Syntax 

Traditional mathematical notation allows one to specify 

/ d / 1 1 /x — i\2 

^ ^dalv27ra^^'^^"2V~5 

By analogy, we extend FORTRAN to encode this as 



ADF( TANGENT (SIGMA) = 1) 

PHI = 1/SQRT (2*PI*SIGMA**2) *EXP (-0.5* ( (X-XBAR) /SIGMA) **2) 

END ADF(PHIPRM = TANGENT(PHI)) 



which computes the derivative phiprm of phi with respect to sigma by forward AD. 
For syntactic details see companion paper (Radul et al., 2012). 

An analogous FARFEL construct supports computing the same derivative with 
reverse AD: 



ADR (COTANGENT (PHI) = 1) 

PHI = 1/SQRT (2*PI*SIGMA**2) *EXP (-0.5* ( (X-XBAR) /SIGMA) **2) 
END ADR(PHIPRM = COTANGENT ( S I GMA ) ) 



Note that with the adr construct, the dependent variable appears at the beginning of 
the block and the independent variable at the end — the variables and assignments in 
the opening and closing statements specify the desired inputs to and outputs from 
the reverse phase, whereas the statements inside the block give the forward phase. 

These constructs allow not just convenient expression of AD, but also modularity 
and encapsulation of code which employs AD. For instance, we can write a general 
scalar-derivative subprogram derivi at user level 



FUNCTION DERIVI (F, X) 

EXTERNAL F 

ADF(X) 

Y = F (X) 

END ADF(DERIV1 = TANGENT(Y)) 

END 



which could be used in, for example. 



FUNCTION PHI (SIGMA) 

PHI = 1/SQRT (2*PI*SIGMA**2) *EXP (-0.5* ( (X-XBAR) /SIGMA) **2) 

END 
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I PHIPRM = DERIV1(PHI, SIGMA) 

DERivi can be changed to use reverse AD without changing its API: 



FUNCTION DERIVI (F, X) 

EXTERNAL F 

ADR(Y) 

Y = F (X) 

END ADR(DERIV1 = COTANGENT ( X ) ) 

END 



allowing codes written with derivi to readily switch between using forward and 
reverse AD. 

To take a more elaborate example, we can write a general gradient calculation 
GRAD using repeated forward AD: 





SUBROUTINE GRAD (F, X, N, OX) 








EXTERNAL F 








DO 1492 1=1, N 








ADF (TANGENT (X (J) ) = 1-MINO ( lABS ( I- J) 


1), 


J=1,N) 




Y = F(X) 






1492 


END ADF(DX(I) = TANGENT(Y)) 
END 







(Note that the adf and adr constructs support implied-DO syntax for arrays.) 
This can be modified to instead use reverse AD without changing the API: 



SUBROUTINE GRAD (F, X, N, DX) 




EXTERNAL F 




ADR(Y) 




Y = F(X) 




END ADR(DX(I) = COTANGENT ( X ( I )) , 


1=1, N) 


END 





Although not intended to support checkpoint-reverse AD, our constructs are suffi- 
ciently powerful to express a reverse checkpoint: 



c 


CHECKPOINT REVERSE F->G. BOTH 1ST ARC IN, 

CALL F(X, Y) 

ADR(COTANGENT(Z (I) ) = ..., 1 = 1, NZ) 

CALL G(Y, Z) 

ENDADR(DY(I) = COTANGENT ( Y ( I )) , 1 = 1, NY) 

ADR(COTANGENT(Y (I) ) = DY ( I ) , 1=1, NY) 

CALL F(X, Y) 

ENDADR(DX(I) = COTANGENT ( X ( I )) , 1 = 1, NX) 


2ND ARG OUT 



This sort of encapsulation empowers numeric programmers to conveniently ex- 
periment with the choice of differentiation method, or with the use of various sorts 
of derivatives, including higher-order derivatives, without tedious modification of 
lengthy call chains. 

Extension 2: Nested Subprograms 

We borrow from Algol 60 (Backus et al., 1963) and generalize the Fortran 
"statement function" construct by allowing subprograms to be defined inside other 
subprograms, with lexical scope. 

For example, given a univariate maximizer argmax, we can express the idea of 
line search as follows: 
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c 


MAXIMIZE F ALONG THE LINE PARALLEL TO XDIR THROUGH X 




SUBROUTINE LI NMAX (F, X, XDIR, LENX, N, XOUT) 




EXTERNAL F 




DIMENSION Y(50) 




FUNCTION ALINE (DIST) 




DO 2012 1=1, LENX 


2012 


Y(I) = X (I) +DIST*XDIR(I) 




ALINE = F(Y) 




END 




BESTD = ARGMAX (ALINE, 0.0, N) 




DO 2013 1=1, LENX 


2013 


XOUT(I) = X(I)+BESTD*XDIR(I) 




END 



Here we are using a library univariate maximizer to maximize the univariate func- 
tion ALINE, which maps the distance along the given direction to the value of our 
multidimensional function of interest f at that point. Note that aline refers to vari- 
ables defined in its enclosing scope, namely f, x, xdir, lenx, and y. Note that if 
ARGMAX uses derivative information, AD will be performed automatically on aline. 



3 Concrete Example 

We employ a concrete example to show the convenience of the above constructs. We 
will also illustrate the implementation on this example in (Radul et al., 2012). Let 
two companies, Apple and Banana, be engaged in competition in a common fashion 
accessories market. Each chooses a quantity of their respective good to produce, and 
sells all produced units at a price determined by consumer demand. Let us model 
the goods as being distinct, but partial substitutes, so that availability of products of 
A decreases demand for products of B and vice versa (though perhaps not the same 
amount). We model both companies as having market power, so the price each gets 
will depend on both their own output and their competitor's. Each company faces 
(different) production costs and seeks to maximize its profit, so we can model this 
situation as a two player game. Let us further assume that the quantities involved 
are large enough that discretization effects can be disregarded. 

An equilibrium {a*,b*) of a two-player game with continuous scalar strategies a 
andb andpayoff functions A (a, fe) andB{a,b) must satisfy a system of equations: 

fl* = argmaxA(a,/7*) b* =aigmaxB{a*,b) (1) 

a b 

Equilibria can be sought by finding roots of 

a* = argmaxA(a,argmaxB(a*,fe)) (2) 

a b 

which is the technique we shall employ. ' Translated into computer code in the most 
natural way, solving this equation involves a call to an optimization subprogram 
within the function passed to an optimization subprogram, itself within the function 
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passed to a root-finding subprogram. If said optimization and root-finding subpro- 
grams need derivative information, this gives rise to deeply nested AD. 

Note that in (2), the payoff function B is bivariate but argmax takes a univariate 
(in the variable of maximization) objective function. The a* variable passed to B is 
free in the innermost argmax expression. Free variables occur naturally in mathe- 
matical notation, and we support them by allowing nested subprogram definitions. 

We can use our extensions to code finding the roots of (2) in a natural style: 



ASTAR & BSTAR: GUESSES IN, OPTIMIZED VALUES OUT 
SUBROUTINE EQLBRM (BIGA, HIGH, ASTAR, BSTAR, N) 
EXTERNAL BIGA, BIGB 
FUNCTION F (ASTAR) 
FimCTION G(A) 
FUNCTION H (B) 
H = BIGB (ASTAR, B) 
END 
BSTAR = ARGMAX (H, BSTAR, N) 
G = BIGA (A, BSTAR) 
END 
F = ARGMAX (G, ASTAR, N) -ASTAR 
END 
ASTAR = ROOT(F, ASTAR, N) 
END 



where we implement just the minimal cores of one-dimensional optimization and 
root finding to illustrate the essential point — root finding by the Rhapson method: 





FUNCTION ROOT (F, XO , N) 




X = XO 




DO 1669 1=1, N 




CALL DERIV2 (F, X, Y, YPRIME) 


1669 


X = X-Y/YPRIME 




ROOT = X 




END 


' 


SUBROUTINE DERIV2 (F, X, Y, YPRIME) 




EXTERNAL F 




ADF(X) 




Y = F(X) 




END ADF (YPRIME = TANGENT(Y)) 




END 



and optimization by finding the root of the derivative: 



FUNCTION 


ARGMAX (F, XO 


, N) 




FUNCTION FPRIME (X) 






FPRIME 


= DERIVl (F, 


X) 




END 








ARGMAX = 


ROOT (FPRIME, 


XO, 


N) 


END 









' The existence or uniqueness of an equilibrium is not in general guaranteed, but our pailicular A 
and B have a unique equilibrium. Coordinate descent (alternating optimization of a* and b*) would 
require less nesting, but has inferior convergence properties. Although this example involves AD 
through iterative processes, we do not address that issue in this work: it is beyond the scope of this 
paper, and used here only in a benign fashion, for vividness. 
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On our concrete objective functions these converge rapidly, so for clarity we skip 
the clutter of convergence detection. 

This strategy impels us to compute derivatives nested five deep, in a more com- 
plicated pattern than just a fifth-order derivative of a single function. This under- 
taking is nontrivial with current AD tools (Siskind and Pearlmutter, 2008a), but 
becomes straightforward with the proposed extensions — embedded AD syntax and 
nested subprograms make it straightforward to code sophisticated methods that re- 
quire complex patterns of derivative information. 



4 Discussion 

The Farfel AD extensions hew to the spirit of FORTRAN, which tends to prefer 
blocks rather than higher-order operators for semantic constructs. (In this, these ex- 
tensions are syntactically quite similar to a set of AD extensions integrated with 
the NAGware FORTRAN 95 compiler (Naumann and Riehme, 2005), albeit quite 
different semantically. Unfortunately those NAGware extensions are no longer pub- 
licly available, and the limited available documentation is insufficient to allow a 
detailed comparison.) A reasonably straightforward implementation technology in- 
volves changing transformed blocks into subprograms that capture their lexical 
variable context and closure-converting these into top-level subprograms, render- 
ing them amenable to processing with existing tools (Radul et al., 2012). Since 
the machinery for nested subprograms is present, allowing them imposes little addi- 
tional implementation effort. Moreover, as seen in the example above, that extension 
makes code that involves heavy use of higher-order functions, which is encouraged 
by the availability of the AD constructs, more straightforward. In this sense AD 
blocks and nested subprograms interact synergistically. 

These new constructs are quite expressive, but this very expressiveness can tax 
many implementations, which might not support some combinations or usages. For 
instance, code which makes resolution of the AD at compile time impossible (an 
«-th derivative subprogram, say) would be impossible to support without a dynamic 
run-time AD mechanism. This would typically not be available. Another common 
restriction would be that many tools do not support reverse mode at all and even 
those that do typically do not allow nesting over reverse mode, either reverse-over- 
reverse or forward-over-reverse. It is the responsibility of the implementation to 
reject such programs with a cogent error 

The Farfel extensions are implemented by the Farfallen preprocessor (Radul 
et al., 2012), which generates input for and invokes existing AD tools. This lever- 
ages existing AD systems to provide the differentiation functionality in a uniform 
and integrated way, extending the reach of AD by making its use easier, more natu- 
ral, and more widely applicable. 

Such a prepreprocessor can target different AD systems (like Adifor (Bischof 
et al., 1992) and Tapenade (Hascoet and Pascual, 2004)), allowing easy porting of 
code from one AD system to another It could even mix AD systems, for example 
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using Tapenade to reverse-transform code generated by using Adifor in forward 
mode, capturing their respective advantages for the application at hand. The effort of 
implementing such retargetings and mixings could then be factored to one developer 
(of the prepreprocessor) instead of many end users of AD. 

A more important benefit of extending FORTRAN with AD syntax and nested 
subprograms is that a host of notions become reusable abstractions — not just first- 
order derivatives, but also their variations, combinations, and uses, e.g., Jacobians, 
Hessians, Hessian-vector products, filters, edge detectors, Fourier transforms, con- 
volutions, Hamiltonians, optimizations, integration, differential-equation solvers. 
The interfaces to different methods for these tasks can be made much more uniform 
because, as our argmax did, they can accept subprograms that just accept the vari- 
ables of interest (in this case, the argument of maximization) and take any needed 
side information from their lexical scope; and subprograms such as argmax can ob- 
tain any derivative information they wish from AD without having to demand it be 
passed in as arguments. So different maximization methods can be tried out on the 
same objective function with ease, regardless of how much derivative information 
they require; and at the same time, different objective functions, that carry different 
side information, can be maximized by the same maximization subprogram without 
having to adjust it to transmit the needed side information. Essentially, derivatives 
are requested where they are needed, and the implementation does the necessary 
bookkeeping. 

These modularity benefits are illustrated by our example program: the Farfel 
input is only 64 lines of code, whereas the amount of code it expands into, which 
is comparable to what would need to be written by hand without these extensions, 
weighs in at a much more substantial 160 for TAPENADE and 315 for Adifor, 
including the configuration data needed to run the AD preprocessors to produce 
the needed derivatives. Manually performing the 5 nested applications of AD this 
example calls for is a tedious, error prone, multi-hour effort, which must be un- 
dertaken separately for each preprocessor one wishes to try.^ Existing AD tools do 
already save the major labor of manually implementing derivative and gradient sub- 
programs, and keeping them in sync with the subprograms being differentiated. The 
further preprocessing step outlined above leverages these tools into being even more 
useful. For larger programs, the savings of implementation and maintenance effort 
would be considerable. 

The present suggestion is not, of course, limited to FORTRAN. Nested subpro- 
grams have gained wide adoption in programming-language designs from AL- 
GOL 60 and beyond, and have yielded proven gains in programmer productivity. 
Their advantages for code expressiveness have led to functions with lexical scope 
being used as a mathematical formalism for reasoning about computing (Church, 
1941), to programming languages organized around the function as the primary 
program construct (Jones, 2003), and to compilers that specialize in the efficient 
representation and use of functions (Jones et al., 1993; Steele, 1978). 



^ A detailed step-by-step discussion of the transfomiation of this example along with all interme- 
diate code is available at http : //www .bcl .hamilton . ie/~ qobi/ fort ran/. 
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Fig. 1 Performance com- 
parison. Smaller is faster. 
Numeric solution of (2) 
with above Farfel code, 
A' = 1000 iterations at each 
level, Farfallen targeting 
two FORTRAN-based AD 
tools; for comparison, the 
same computation is coded 
in VLAD (Pearlmutter and 
Siskind, 2008) and compiled 
with Stalingrad (Siskind 
and Pearlmutter, 2008b). 
Computer: Intel 17 870 @ 
2.93GHz, Gfortran 4.6.2- 
9, 64-bit Debian sid, -Of ast 
-fwhole-program, single 
precision. See (Radul et al., 
2012) for details. 



CPU Time (seconds) 




Tapenade 



Adifor Stalingrad 



One can also add adf- and ADR-like constructs to other languages that have pre- 
processor implementations of AD, for example, C and ADIC (Bischof et al., 1997). 
One would not even need to add nested subprograms in the preprocessor, because 
GCC already implements them for GNU C. Doing so would expand the convenience 
(and therefore reach) of existing AD technology even further. 

Retrofitting AD onto existing languages by preprocessing is not without its limi- 
tations, however. Efficient AD preprocessors must construct a call graph in order to 
determine which subprograms to transform, along with a variety of other tasks al- 
ready performed by the compiler. Moreover, optimizing compilers cannot be relied 
upon to aggressively optimize intricate machine-generated code, as such code often 
exceeds heuristic cutoffs in various optimization transformations. This imposes a 
surprisingly serious limitation on AD preprocessors. (Together, these also imply 
a significant duplication of effort, while providing room for semantic disagree- 
ments between AD preprocessors and compilers which can lead to subtle bugs.) 
This leads us to anticipate considerable performance gains from designing an opti- 
mizing compiler with integrated AD. Indeed, translating our concrete example into 
VLAD (Pearlmutter and Siskind, 2008) and compiling with STALINGRAD (Siskind 
and Pearlmutter, 2008b), our prototype AD-enabled compiler, justifies that suspi- 
cion (see Fig. 1). We therefore plan to make a VLAD back-end available in version 
2 of Farfallen. 



5 Conclusion 



We have defined and motivated extensions to FORTRAN for convenient, modular 
programming using automatic differentiation. The extensions can be implemented 
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as a prepreprocessor (Radul et al., 2012). This strategy enables modular, flexible 
use of AD in the context of an existing legacy language and tool chain, without 
sacrificing the desirable performance characteristics of these tools: only about 20%- 
50% slower than a dedicated AD-enabled compiler, depending on which FORTRAN 
AD system is used. 
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